home *** CD-ROM | disk | FTP | other *** search
- # How long a time lies in one little word! -- Shakespeare, Richard II, I, iii
-
- # Fine words butter no parsnips. -- Southern proverb
-
- [ef]?grep Implementation Changes
-
- 'grep' r.e. translation:
-
- To buy speed for the novice 'grep' user who deigns not to learn the
- extended 'egrep' syntax, we translate 'grep' r.e.'s to the 'egrep' superset.
- It is straightforward enough to surround search patterns meaningful to
- 'egrep' but not to 'grep'. Odd cases include the -w option, not implemented
- in standard 'egrep', the defunct -y option, and "tagged expressions", which
- are done via an exec() of /bin/grep. Tagged exprs, like
-
- grep '\(....\).*\1' /usr/dict/words
-
- which outputs chaff like "beriberi", "couscous", "hodgepodge", and
- "lightweight", are weird. The irregularity these exprs lend coupled with
- a low complexity/utility ratio kept them from being part of 'egrep'.
- But for this feature, old 'grep' code could be thrown away.
-
- 'fgrep' improvement / (partial) unification:
-
- In the new release, we trap low-complexity disjunctions such as
-
- egrep 'boyer|moore' file
- or
- fgrep 'boyer\n
- moore' file
-
- (or with "-f patfile" in place of the pattern) with a method to superimpose
- the non-terminals within the Boyer/Moore table. When scanning text backwards,
- other programming tricks short-circuit some tests against the pattern.
- Sparing further details, which might make for a more formal writeup, it
- suffices to say that although worst-case complexity here is O(Rn) with string
- length 'n', and R == r.e. size, average-case for text is still sublinear. E.g.
-
- egrep 'silver|orange|purple' # hard-to-rhyme color test in eg.shell
-
- looks at ~55000 chars in /usr/dict/words, whereas (three) separate invocations
- of egrep on the individual color words make the code look at ~40000 bytes per
- word. Aho/Corrasick's 'fgrep', in contrast, must look at all 200KB in the
- dictionary. The elegant "trie" construction within "fgrep" excels, however,
- for medium/large R. An equally ambitious "reverse trie", supplementing our
- extant "alternation mush", would attenuate worst-case behavior while preserving
- low R speedup. We save the addition for another day.
-
- Since the syntax for [ef]grep is similar, we thought of making egrep
- hand off to fgrep for sufficiently large metacharacter-free R, as there is no
- strong reason to make the user conscious of the separate algorithms. Certain
- technicalities prevent this. For one, we are not willing to invent an 'egrep'
- option to inform the code to interpret a file of (say a hundred) word
- alternatives containing some innocent metacharacter, that it is literal
- 'fgrep' input, rather than a closure-containing 'egrep' pattern which would
- otherwise make egrep explode. More work could be done here.
-
- Our motivation? Is this not all overblown? Perhaps, but now you can
- build a simple fast "NSA filter", or search for the seven dwarfs at leisure.
- Besides, the final nail needed to be driven into 'bm/match', which tried
- to do parallel match, but actually shuffled things out of order during its
- simplistic block-based scheme. These programs, part of source archive also,
- are now historical curiosities.
-
- Kanji egrep:
-
- Copious notes are in README.kanji.mods. The March 1987 Unix Review
- was indispensable for pointing out the embedded "SS2" Katakana pitfalls.
- The modularity of our code as a semi-dependent filter was necessary for this
- exploration, as we have no access to AT&T/Unisoft Kanji code. Again, JUNET
- or Sigma project people -- please respond with grep war stories or usage notes.
-
- Worst-case r.e. handling:
-
- The first code release elaborately called upon a function mcilroy()
- to pipe partial match output to old egrep for tough expressions, whose
- backtracking might swamp regexp(). Some details of the DFA/NDFA tradeoff
- were discussed in pep4grep.doc[12]. Due largely to feedback from John Gilmore
- of the GNU project, the strategy was revised. egrep.c function kernighan()
- ("let others do the hard part") now usurps calls to costly popen() by invoking
- exec() on old egrep when necessary. Rough partial match statistics gathered
- on the fly determine the handoff. You may revise the time reported previously
- for
- egrep 'hoe.*g' /usr/dict/words
-
- from 1.2 user, 1.1 sys seconds (VAX 11/750, 4.3BSD, Fuji disks) to 0.8u, 0.4s.
- For those public-spirited souls who really want to build a PD egrep out of
- what we offer, sans fallback from regexp() to an AT&T /usr/bin/egrep, the
- slippery test "egrep 'g+h+o' /usr/dict/words" will prove enlightening.
-
- Faster -n option:
-
- By popular demand. Though Boyer/Moore techniques subvert line numbering,
- we've made it faster with brute force (loop unrolling helps VAXEN, but not
- CRISPS). Timing tests for this and other options appear in the eg.shell script.
-
- Not so fast:
-
- -v, -b, -w, various r.e.'s with no rexexp() "residue"
-
- (you'll still have to use the layered "grep host /etc/hosts | grep -w host"
- for speed.)
-
- Other contra-indications for new [ef]?grep:
-
- Monster patterns
-
- The amazing expressions output by /usr/lib/calendar still beg for
- the lazy evaluation technique rolled into edition 8 egrep by Prof. Aho of
- Princeton. Hinted at on p. 105 in Kernighan & Pike, lazy evaluation reduces
- standard egrep r.e. compile time. Here the possible O(R**2) machine
- construction cost is eliminated to amortize complexity at run-time and
- shifted to such only if a bad match actually happens. Whew! Fortunately,
- this is not so important for simple r.e. fare, where H. Spencer's regexp()
- works well, but it certainly helps calendar(1).
-
- The catch with lazy eval. is that it slows down simple matching (15-20%
- for /usr/dict/words on VAX), so it hasn't been adopted by System V egrep.
- Note that our egrep, deferring to the underlying one in /usr/bin, doesn't
- care much about these hideous beasts -- it just doesn't do better on them.
- However, [ef]?grep does well by the Kadhafi matcher (eg.shell, again).
-
- Long lines, small alphabets
-
- Finally, a comment on one rapidly burgeoning application area
- where new egrep should not be blindly proscribed -- genome sequencing.
- Though line limits have been raised (to 8192 byte buffers), much of
- GENBANK has no newlines. The code would need modification for scanning
- freestyle. Also, locating ACGT sequences with the current "superimposed BMG"
- over a 4-letter alphabet might actually be worse, but the global homology
- crowd probably uses a >20 letter protein alphabet (for other reasons).
- At any rate, genetic string search generally relies on more sophisticated
- methods such as dynamic programming ala Sankoff/Kruskal.
-
- On the other hand, large alphabets such as Kanji probably help
- performance. As do parallel transfer disks, MACH file mapping, ...
- Your suggestions welcome.
-
- James A. Woods (ames!jaw)
- NASA Ames Research Center
-
- P.S. Preserving author credit, [ef]?grep may be redistributed as you wish.
-